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Title of the Inv ntion 

METHOD OF SPEAKER NORMALIZATION VOK StfKEUi RECOGNITION 
USING FREQUENCY CONVERSION AND SPEECH RECOGNITION APPARATUS 
APPLYING THE PRECEDING METHOD 

Field gf the Invention 

This invention relates to a speaker normalisation method 
for adjusting utterance diversity coming of speaker differences 
by handling inputted acoustic feature parameters, and to a 
speech recognition apparatus applying the earn© method. 

Background gf the Invention 

A speech recognition apparatus using a speaker 
normalization method as descriDefl in JP-A-2001-2*>5&S6 is 
conventionally known. In the speech recognition apparatus, A/D 
conversion is first made to use to digitize the input speech 
utterances, thereby extracting feature parameters . such as LPC 
cepstrum coefficients. Then, boundary of voiced and unvoiced 
speech is determined to detent, voiced and unvoiced speech 
segment. Then, in order to normalize the effect as caused by 
the individual difference of the utterances, come from 
diversity of vocal tract length af the speakers, and the 
obtained feature parameters, such as LPC cepstrum, is converted 
on the aepect of frequency axis. 

Then, matching is made between feature parameters o£ 
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input utterance c nverted on the frequency axis and an 
acoiwtsr.-n.od*! featur* parameters previously learned with the 
training utterances by quantities of speakers, to compute at 
least ono recognition result candidate . Thereafter, the optimal 
conversion coefficient is determined by using the input 
utterance as a teacher signal, on the basis ot a computed 
recognition result. In order to cancel the variations of 
speakers or utterances, the frequency conversion coefficients 
are smoothened and then, updated into new frequency conversion 
coefficients. The updated, new frequency conversion 
coefficients are used as new frequency conversion coefficients . 
to repeat matching with the acoustic -model feature parameter 
again. In this series of steps, a recognition candidate is 
finally obtained for use as fl recognition result. 

Heanwhlle, JP-A-2002-189492 describes a speech 
recognition apparatus using a technique to expand and contract 
the inputted utterances on their spectral frequfmny. This art 
deduces phoneme boundary information on each utterance, to 
thereby deduce a frequency expansion/contraction condition 
based on the phonemic segments derived from the phoneme boundary 
information - 

However, these conventional methods have the drawback 
that a subject-of-recognition word lexicon is needed to carry 
out speaker normalization. These methods require detail 
information obtained from detection or deduction about boundary 
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of phonemes, voiced end unvoiced area, and voiced area, inside 
of each utterance. 

Summary of the Invention 

The present invention ie for solving the conventional 
problem, and it is an object, to impl«mpsnt a speaker 
normalization procedure instead of using a subject -of - 
recognition word lexicon. Without making a deduction or 
detection of a segment of information or pnoneme thereby 
correcting for the individual difference ol input utterance and 
improving speech recognition performance. 

A method of speaker normalization of the present 
invention comprises: a feature parameter extracting step of 
segmentalize one input speech utterance into constant time 
length frames and compute one or one set of acoustic feature 
parameters of each frame,- a frequency converting step of doing 
frequency- conversion on the aspect of frequency of one or the 
one set of acoustic feature parameters by using plural frequency 
conversion coefficients previously defined; o step of using all 
combinations of plural converted feature parameter sets 
obtained by the frequency conversion procedures and one or more 
standard phonemic models , to compute more than one similarities 
or distances between the converted feature parameter sets of 
each of Lhe frames and the standard phonemic model; a step of 
deciding a frequency converting condition for normalising the 
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input utterance by using more Lhan one of similarities or 
distant; and a step of normalising the input utterance by the 
previously determined frequency conversion conditio*- 

Meanwhile, an apparatus for speech recognition of the 
invention comprise*, : a feature parameter extracting section for 
segmanti ng an 1 npot speech utterance into a constant time length 
Barnes and extracting one or one set of acoustic feature 
parameter each of the frames a frequency converting section 
to convert the acoustic feature parameter on their frequency 
axis by using more than one of frequency conversion coefficients 
previously defined,- a similarity or distance computing section 
using all combinations of converted feature parameters obtained 
by the frequency conversion and the standard phonemic model Lo 
compute the similarities or distances between the post- 
conversion features of the each frames and the standard phonemic 
model; a frequency conversion condition deciding section for 
fixing a frequency converting condition to normalize the input 
utterance on their frequency axis by using the similarities or 
distanoes.- and a a peech-recognition processing section for 
recognizing an inputted utterance with intended lexicons and 
intended acoustic models,- whereby the input utterance is 
normalised by using the determined frequency conversion 
condition, thereby effecting speech recognition. 

Thus, normalising an input utterance in this manner that 
matching with acoustic f ature paramet rs £ standard speaker 
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as previously explained, the differ no of input utterances 
cauacd by speaker divcraity ia normaliacd without using a 
S ubject-of-reeognition word lexicon, thereby improving the 
recognition performance. 

Brief Description o£ the Drawings 

Fig. 1 is a block diagram showing the hardware of a speech 
recognition ayatem according to embodiment 1 of the preaent 
invention; 

Fig. 2 is a functional block diagram showing a functional 
configuration of the speech recognition ayatem according to 
embodiment 1 of the invention; 

Fig. 3 is a flowchart showing a process of the speech 
recognition ayatem according to embodiment 1 of the invention, 
fig. 4 is a functional block diagram showing a functional 
configuration of a speech recognition system according to 
embodiment 2 of the invention; 

Fig. 5 is a flowchart showing a process of the speech 
recognition system according to embodiment 2 of the invention; 

Fig. 6 is a functional block diagram showing a functional 
configuration of a speech recognition system according to 
embodiment 3 of the invention; 

Fig. 7 is a flowchart showing a process of the speech 
recognition system according to embodiment 3 of the invention; 
Fig. 8A ia a relati nahip figure between ph neme and 
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conversion coefflcl nt in each frame according to embodiment 
X of tho invention while Fig . 8B is a relationship figure between 
conversion coefficient and frequency according to embodiment 

1 of the invention; 

Fig. 9A is a relationship figure between phoneme and 
conversion coefficient according to embodiment 2 ef the 
invention while Fig. 9B is a relationship figure between 
selected phoneme and conversion coefficient according to 
embodiment 2 of the invention; 

Fig. 10A is a relationship figure between phoneme and 
weight in each frame according to embodiment 3 of the invention 
while Fig- inn is a relationship figure between aversion 
coefficient and weight according to embodiment 3 of the 
invention; 

Pig. Ill is a figure showing a result of speech 
recognition according to embodiment 1 of the invention, Fig. 
11B is a figure ehowing a result of speech recognition according 
to embodiment 2 ot the invention. andKig. lie is a figure showing 
a result of speech recognition according to embodiment 3 of the 
invention; 

rig. 12 is a block diagram showing tne function of an 
integrated speech remote- control for home-use appliances 
according to embodiment 4 of the invention; and 

Fig. 13 is a figure showing a display screen of a display 
device acc rding to embodiment 4 of the invention. 
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Description of the Exemplary Embodiment 

Exemplary embodiments of the present invention are 
demons tr a Lea hereinafter with reference to the accompanying 
drawings • 

1 . Pi rst Exemplary Embodiment 

Fig 1 is a block diagram showing the hardware of speech 
recognition oyotcm using speaker normalization according to the 
first embodiment of the present invention. In Fig. 1, a 
microphone 101 captures a speech utterance, and an A/D converter 

102 converts the analog signal of utteranoe into a digital 
signal. A serial converter (hereinafter refftrrad to as "SCO*) 

103 forwards LLe serial siyual from the A/D converter 102 onto 
a bus data line 112. A storage device 104 is stored with a 
standard speaker group phonemic model (hereinafter referred to 
as ■ standard phonemic model") as a group of numerals 
statistioally processed of the phoneme-baaed feature 
parameters previously learned from the utterances of plural 
speakers and a word model obtainable by connecting half- 
syllable -fragment models as a numeral group obtained by 
statistical processing the half -syllable -fragment based 
feature parameters previously learned from the plural speakers 1 
utterances - 

A parallel 10 port (hereinafter referred to as PIO) 105 
outputs a standard phonemic model or word model from the storage 
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4e»i« 104 onto the bus Una 112 synchronously Witt, a bus clooX. 
to output a opa.oh raeosnition rooult onto an utput unit 110 
„ 1Bh „ a aisplay. * ■»• '07 1. a tasporary .taring «aory for 
1„ exacutins OUaa pxucsssin, . A DMA oontroUar (haralnattar 
rof orrod to aa "DMA" ) 106 oontrola to. hi 3 h-spssa data t»n.f or 
oetaean th. .torsos oavlc. 101. th. ontpnt unit 110 »d tna »» 



107. 



A ROM 108 ia written with a process program and preset 
data, such as transform coefficients for fr«n»«ncy conversion, 
referred later. The SCO 103. the PIO 105, the DMT, 106, the RAM 
107 and the BOM 108 ere connected through the bus and placed 
unaer control hy a CPU 109. The CPU 109 can he replaced with 
a diyiUl signal processor (DSP) . 

The elements of SCO 103 to CPU 109 set up a speech 

recognition apparatus 100. 

Now. the functional block configuration of the 

bardie-configured speech recognition apparatus 100 shown in 

Fig. l is explained, with using Fig. 2. 

A feature parameter extracting section 201 extracts an 

acoustic feature parameter or acoustic feature parameters to 

be obtained by time-divided data of the inputted utterance SIG1 . 
The input utterance. SIG1. is digital data. And their setteble 
spring fluency Has variations as usua, speech */D system. 
e q e KHz on telephone speech and 44.1 KHz on CD audio 
application. The sampling fx guenoy of present embodiment 1 
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Equation 1 



a in Equation 1 is referred to as a frequency conversion 
coefficient (hereinafter referred to as "conversion 
coefficient"). Although the conversion coefficient a is in 
nature a variable value , the present embodiment 1 used ceven 
discrete values a x to rv i.ft. '-0.15', '-0.1 1 , '-(LOS 1 . '0 1 , 
' +0 . 05 1 , ■ +0 . 10 ■ and 1 +0 . 15 1 , for the convenience of processing . 
These are hereinafter referred to as a conversion coefficient 
group. 

A frequency converting section 202 makes a frequency 
conversion process using installed conversion coefficient 
according to Equation 1- A conversion- coefficient setting 
section 203 sets the frequency converting section 202 with 
plural conversion coefficients. A similarity, which means 
similarity deyxee, ur dlsLanue computing section 204 reads 
standard phonemic model data from a standard phonemic model 205 , 
and anmpntas a similarity or rtistanne thereof to each of plural 
input acoustic feature parameters after convereiuu 
(hereinafter referred to ac "poet -conversion feature 
parameter" ) on plural conversion coefficients obtained from the 
frequency converting section 202. The similarity or distance 
in this embodiment is detailed later, M anwhile, the 
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computation result is stored In a result storage section 206. 

The standard phonemic modal 205 comprises a group o£ 
numerals as a result of the statistically processed feature 
parameter on the following 24 phonemes; 

/a/ Jo/ ,/u/,/i/. /e/./j/, /w/,/m/./n/,/ng/, /b/ ,/d./. /r/ f 
/z./hv/,/hu/,/s/,/c/,/p./t/,/k/./yv/,/yu/./n/. 

Selecting the phoneme is described in The IEICE (in Japan) 
Transactions on Information and Systems ,PT.2( Japanese 
jsaition) u-ii wo. u pp. auy* - pp. ;uu3. 

A word model 210 is to represent a sub jeot-of -recognition 
word obtained by connecting half -syllable- fragment models, and 
corresponds to one example of subject-of-recognition standard 
acoustic model. The standard phonemic model 205 and the word 
model 210 are both stored in the storage device 104. The both 
trained with the same utterance set of the same standard speaker 
group by the use of a statistical process. 

A conversion-condition determining section 207 
determines a conversion condition fur use in speech recognition 
from the result of storage in the result storing section 206. 

A feature -parameter storing section 208 i« a memory tor 
temporarily s luring the feature parameter extracted in the 
feature -parameter extracting section 201 until speech 
recognition process is completed. Fart of the KAM 107 is 
allocated Lo sLore them. 

A speech-recognition processing section 209 operates a 
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similarity or distance between a frequency- c nverted feature 
parameter and a word model 210. to th reby determine a word. 
Meanwhile, the recognition result is outputted to an output unit 
110- 

The operation o£ the speech recognition apparatus 100 
thus functionally configured is explained by using the 

flowchart shown in Pig. 3- 

At first, the feature -parameter extracting section 201 
extracts a seven-dimensional LPC mei-cepstrum coefficient 
vector as an acoustic feature parameter, frame by frame, from 
the utterancs inputted through a microphone 101 and then changed 
to a digital signal through the a/d converter 102 {step S301) . 
The extracted feature parameter is outputted to the frequency 
converting section 202 and simultaneously stored to the 
feature -parameter storing section 208. 

Then, the conversion coefficient sotting section 203 
sets the frequency converting section 202 with a predetermined 
conversion coefficient. The frequency converting section 202 
makes a frequency conversion on the acoustic feature parameter 
by this conversion coefficient, according to Equation 1. 
thereby determining a post-conversion feature parameter. The 
conversion is made on all the conversion coefficients of the 
conversion coefficient group. Hence the number of convertea 
feature parameters of each frame is eeme to the number of 
conv rsi n coefficients included in the conversion coefficient 
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group (st p 3302) n 

The similarity or distance computing section 204 
compares one set of the converted feature parameter with ail 
phonemes of standard phonemic model read out of the standard 
phonemic model 205. This comparison can use both methods , a 
method of compare between single frames and a method of compare 
between plural frames, by adding the preceding/succeeding 
several frames. In the embodiment l,a similarity or distance 
computation use a width of 7 frames added with the respective 
preceding and succeeding 3 frames to a focusing frame. And 
compare to calculate the similarity or distance of inputted data 
and standard phonemic model included in the standard phonemic 
model 205 (step S303). 

Thft result is stored to the result storing section 206. 
Incidentally, the similarity or distance computing section 204 
makes a computation process of similarity or distance on all 
the computed post-ennvftrfiion fMtnra pflram«t«rs. 

As the method or computing a similarity or distance 
between a converted feature parameter and a standard phonemic 
model, there are a method of using a similarity making by a 
phonemic recognition with statistic processed model having 
distribution as a standard speaker group of utterance model, 
and a method of using a physical distance with a phoneme-based 
representative value as a standard speaker group of utterance 
model . However ( the similar effect is available even upon using 
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another similarity degree or distanc measure. 

New. two examples are explained on the standard phonemic 
node! Z05 in which thi Phonemes for use in speaker normalization 
are modeled . 

The first example is a case to use a similarity sought 
by making phoneme recognition with adopting a statistic process 
having a distribution as a 3 tandard speaker group of utterance 
n odel. in this case, Mahalanobis generalised distance is used 
as a measure to determine a similarity for phoneme recognition, 
wherein measurement take piece by collected acoustic f eeture 
parameter of successive 7 frames in an utterance part 
corresponding to each phoneme of standard speaker utterances, 
and a mean value and covariant matrix is sought to make a 
conversion into coefficient vectors. 

The second example is a case to use a physical distance 
by adopting a phoneme based selected value as a standard speaker 
group of utterance model . This is configured by s mean vector 
9ryuy of acoustic feature parameter in successive 7 frames of 
an utterance part corresponding to each phonemes from a standard 

spaaJwsr ot utterance. 

incidentally. Mahalanobis generalized distance is 
explained in JP-A-60-67996, for example. 

The results of the two cases, i.e. the case of using the 
phonemic recognition similarity and the case of using the 
distanc to the phoneme-based typical value, are described 
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later. 

The data stored in the result storing section 206 must 
De a distance to a phoneme-based selected value, a 
representative model, or a likelihood of phonemic recognition 
with each input frame and 24 phonemes phone-based selected 
value. 

The steps S301 to S303 are executed on all the frames in 

the speech segment. 

then, the conversion- condition determining section 207 
determines a conversion coefficient candidate of the highest 
similarity to the phoneme within the input frame according to 
Equation 2 (step S304). 

& - argmax HIT \ a. 6) aquation 2 

a 

in Equation 2, L expresses the similarity, X« the spectrum 
given by frequency conversion along Equation 1 , « the conversion 
coefficient and e the standard phonemic model. A conversion 
coefficient a is searched and decided which makes the similarity 
degree maximize between a spectrum X« and a standard phonemic 
model 6. This embodiment 1. using seven discrete values «> to 
a, for the convenience of processing, selects and decides a 
diversion coefficient a at which the highest similarity is 
obtainable from among the similarities in the respective cases 
to which all the seven discrate values are applied. Namely, the 
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similarities obtained from applying the seven discrete values 
are mutually compared, to select a c nvcraion coefficient a at 
which the highest similarity is obtainable. 

in the case that comparing the phonemic feature parameter 
results in distance, a conversion coefficient representative 
of the nearest distant in ***** according to Equation 3. 

ft. = argma* DOT | a,H) Equation 3 

rt 

in Equation 3. D represents the distance, X" the spectrum 
given by frequency conversion along Equation 1 , a the conversion 
coefficient end 6 the standard. ph«n«n1« model- A conversion 
coefficient a Is searched and decided which makes minimum the 
distance between e spectrum X" and a standard phonemic model 
6. This embodiment selects and deride* a conversion 
coefficient a at which the smallest or nearest distance is 
obtainable, from among the distance* in the respective case S 
to which all the seven discrete values axe applied. Namely, 
the distances obtained from applying the seven discrete values 
are mutually compared, to select a conversion coefficient a at 
which the smallest distance is obtained. 

Then, a phoneme highest in similarity degree or smallest 
in distance to the input is selected frame by frame, to determine 
a conversion coefficient in a manner nearlng the phoneme of 
standard phonemic model (step S305). Pig. 8A is a figure 
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showing th phon me-bas a conversion coefficients on 8*11 the 
frames showing this status. In Fig. 8A, th maximum lik lino d 
of conversion coefficient 801 Is selected for each phonemes 
within the frame, to determine the maximum likelihood or phoneme 
802 by computing a similarity or distance. Then, a conversion 
coefficient 8U3 corresponding to the relevant phoneme is 
determined. For example, in the case that step S305 determines 
that the maximum likelihood in the first frame is selected under 
the condition of a phoneme /a/ and conversion coefficient. « 4 . 
the conversion coefficient a« used in that frequency conversion 
is given as a conversion coefficient for the first frame. 

Then, the conversion-condition determining section 207 
cumulatively stores the occurrence frequency over the entire 
R peeoh segment under the frequency converting condition 
corresponding to the selected phoneme, for each frame 
determined in the step S305, Then, the stored occurrence 
frequencies are compared each nthRr to determine the conversion 
coefficient ol highest occurrence frequency as a frequency 
converting condition for the entire segment, and notifies it 
to the conversion-coefficient setting section 203 (step s:«in) . 
Pig. 8B is a figure showing the relationship between the 
conversion coefficients and the cumulating frequency. In Pig. 
8B. is given a frequency converting condition because of «« 
having the greatest frequency. 

By the ab ve steps S301 to S306. a frequency conv rsi n 
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o efficient for us in a speech recognition process is 
determined. According to the steps S301 to S306 . one conv reion 
coefficient is selected ior frequency conversion for each input 
freine. However, because there ere differences between the 
conversion, coefficients selected based on each input frame, 
speaker normalization can be implemented more finely based on 
each input frame. Thus, any utterance input can be normalized 
about speaker-based difference. 

Then, the conversion- coefficient setting section 20a 
sets a notified conversion coefficient to the frequency 
conversion section 202. After this transaction, the frequency 
convertlna section 202 reads a stored feature parameter out of 
the feature-parameter storing section 200, and carries out a 
frequency conversion over the entire speech segments starting 
from the first frame (step S307). The converted feature 
parameter as a result of that procedure is outputted to the 
speech-recognition processing section 209. 

These steps S301 Lo S307 are fur the processing of speaker 
normalisation. Because this prooeee normalizes the input 
utterance in a manner matched to the standard spaaJcar . the input, 
utterance is normalized lor its speaker-based difference 
thereby improving recognition performance. 

Then, the speech-recognition processing section 209 
carries out a speech recognition process using the converted 
feature parameter. For this processing method, a method using 
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Hidden Markov model, a method with dynamic time warping, a 
method with neural networks . and et al . are known . The present 
embodiment 1 used a speech recognition method disclosed in 
JP-A-4-369696, JP-A-5-150797 and JP-A-6-266393 . The speech- 
recognition processing section 209 carries out a speech 
recognition process oy the use ot an input and word model, and 
outputs a recognised word as a speech recognition result to the 
output unit 110 (step S308). 

as described above, tne present embodiment 1 determines 
a frequency converting condition using with the similarities 
or distances of all the 24 phonemes, being considered to be 
sufficient in speech recognition. Using this speech 
normalization is able to improve the recognition performance 
for every speech utterance, which can be inputted to the speech 

recognition apparatus. 

The step S307 of this embodiment 1 cumulatively stored 
the mimber of occurrences of frequency converting conditions 
fur all selected phonemes, but it is possible to count and store 
the number of times when the selected phoneme is only a vowel. 
This procedure determines a frequency converting condition for 
the entire segment from the information of only vowels, that 
has highest reliability to a subject of frequency conversion. 
Hence it is possible to provide the higher reliability than a 
determined frequency converting condition. 

Pig. 11A shows results of spe ch recognitions with 
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speak r normalisation carrying out and without carrying out, 
according to the present emb diment 1 in the r sp ctive cases. 
This test was conducted with 100-word utterance by three 
speakers who are not included in the acoustic model trained 
speakers . with using a word lexicon having an entry of 100 words . 
Speaker normalization improved the recognition rate by 7 to 21% . 
This con confirm that the above effect is obtainable, even in 
case speaker normalization is conducted in eontinuing- 
length-fixed phoneme recognition, without using a subject- 
ed recognition word lexicon, and without segment detection of 
voiced and unvoiced sound, in computing a distance between input 
and standard phonemic model. 

Incidentally, the present embodiment 1 determines a 
conversion coefficient adapted over the entire speech segment 
after making a frequency conversion process over the entire 
speech segment. However, it is possible to take it as a 
conversion coefficient adaptable over the entire s P ««ch segment 
at a time point thai any uf conversion coefficients has been 
selected as a frequency converting condition a predetermined 
number of times. This can reduce the time of speech 
recognition. 

2. second Exemplary Embodiment 

Fig. 4 shows a functional configuration of a speech 
recognition apparatus according to a second embodiment of the 
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invention. This is different from the first embodiment in th t 
a similarity or distance computing section 204 compares, with 
a standard phonemic model 205, an acoustic feature parameter 
outputted from a feature-parameter extracting section 201 
besides an output from a frequency converting section 202. 
There is a further difference in that a conversion-condition 
determining section 207 determines a conversion condition by 
using a result of representative phoneme, referred later, of 
among the results obtained from the similarity or distance 
computing section 204 end stored in a result storing section 
206. 

now. the speech recognition operation of the present 
embodiment 2 is explained with using Pigs. 4 and 5. The former 
half process of steps S301 to S304 in Pig. 5 is similar to that 
of the steps of the embodiment l explained in Fig. 3. wherein 
the conversion-condition determining section 207 determines a 
phoneme-based frequency converting condition for each fratn*. 

Then, the conversion- coiidl Lion determining section 207 
cumulatively stores the occurrence frequency of frequency- 
conversion conditions decided on each phoneme in the step S304 
(step S501). Tig. 9A is one example ol fi 9 ure showing the 
relationship between a phoneme and a conversion coefficient 
generated as a result of this process. Meanwhile, the 
conversiuu-condlLion determining section 207 selects a 
conversion coeffici nt in highest-frequency, for each phoneme. 
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and decides it as a conv rsion coefficient or the phoneme for 
tho entire speech segment (step S502). Pig. 9k shows that a t 
is selected as a conversion coefficient for the phoneme /a/ 
while a, is selected as a conversion coefficient for the phoneme 
/e/. 

At. the same time, the conversion- condition determining 
section 207 decides a phoneme representative Tor each frame of 
the relevant input frame, over the entire segment of input frame 
(step SS>U3). in this embodiment, the similarity or distance 
computing section 204 compares an output of the feature 
parameter extracting section 201 with each standard phonemic 
model stored in the standard phonemic model 205. to select as 
a typical phoneme, with the highest similarity of among the 
similarities stored in the result storing seetion 206 or with 
minimum distance to the phoneme-oased representative value. 

Meanwhile, the conversion- condition determining section 
207 selects a conversion coefficient corresponding to a 
representative phoneme of Hie lupul frame, depending upon Lhe 
decision in the step S502 . This prooeos is made over the entire 
segment of input frame, malting notification ro the 
conversion-coefficient setting section 203 (step S504) . Pig. 
9B is one example of figure showing a relationship between a 
representative phoneme ot every frame and tne corresponding 
conversion coefficient . 

Then, the conversion-co fficient setting secti n 203 



22 



200309-19(1) 13:12 S^:O-010-'_-610-4C73701 MtfMH**. UttS 



R:844 P. 24/47 



sets the Irequency converting section 202 with on adaptive, 
notified conversion eoeffici nt. for each i»put fram . -The 
frequency converting section 202 in turn reads a stored feature 
parameter out of the feature-parameter storing section 208, and 
carries out a frequency conversion process for delivering to 
the speech-recognition processing section 209 (step S505) . This 
process is carried out over the entire speech segment. . 

The above steps S301 to S50S are for the processing o£ 
speaKer normalization in the present embodiment 2. The 
subsequent speech-recognition processing step S308 is 
identical to the speech-recognition processing step S308 
explained on Fig. 3 in the embodiment l. 

As described above, the present embodiment 2 selects one 
conversion coefficient for carrying out a frequency conversion 
on each input frame. However, because the conversion 
coefficient is selected on each input frame one by one. speaker 
normaiisation can be effected finely frame by frame. Speech 
utterance, in any. can be inputted to the syeeoh reuoguiUon 
apparatus using the speech normalisation, thus improving the 
performance of recognition. 

Fig. 11B shows a result of speech recognitions according 
to the present embodiment 2 in the respective cases in which 
speaker normalization is carried out and not carried out. This 
test was conducted with 100 word input utterance by nine 
6P akers who ax not included in the acoustic model trained 
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speakers , with using a word 1 xicon having an entry of 100 words . 
Speaker normalization improved the recognition rate by 8.2* of 
the children who had heen lower than that or adults. This can 
confirm that the above effect is obtainable even in case a 
speaker normalization condition is determined by using a result 
of a continuing-length fixed phoneme recognition or of a 
distance computation between an input and a phoneme standard 
phonemic model, without segment detection of voiced and 
unvoiced sound, and without carrying out a recognition process 
using a sub ject-of -recognition word lexicon. 

3. Third Exemplary Embodiment 

Pig. 6 shows a functional configuration of a speech 
recognition apparatus according to a third embodiment of the 
invention. This is different from the second embodiment in that 
there is provided a phoneme-weighting computing section 601 for 
computing a weight of each phoneme from a feature parameter. 

Now. the operation of speech recognition of embodiment 
3 is explained with using Pigs . 6 and 7 . The former-half process 
of steps S301 to S502 is similar to that of Fig. 5 explained 
in the second embodiment, i.e. the conversion- condition 
determining section 207 determines a frequency converting 

condition for each phoneme. 

A conversion- condition determining section 207 
det mines phoneme weights, frame by fram , for the entire 
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segment of input speech (step S701). For determining the 
weights, a similarity r distance c mputing s ction 204 
compi.it** a similarity degree between an output, of the 
feature-parameter extracting section 201 and each phoneme 
standard phonemic model of standard phonemic model 205 or a 
distance thereof to a phoneme -based representative value - The 
computed distance is stored in a result storing section 206. 
Thereafter, a oonvereion- condition determining section 207 
determines a normalized weight by using Bquation 4. 

in Equation 4, Wlk represents the weight, X the input 
spectrum. V the phoneme-based representative value vector, k 
the phoneme kind, p the parameter representative oi a smoothness 
of interpolation, and d(X. V) the distance oX between an input 
epectrum end a phoneme-based representative value as determined 
according to Equation 5. 



u _ ... 3 X i' V k) Equation 4 

W * " |bXX l( V K r 

d(X, V) - |x - V |" Equation 5 

The conversion-condition determining section 207 carries 
out the shove process over the entire speech segment, to compute 
a phoneme-based weight on each frame. As a result of the 
computation, btained is a relationship between a phoneme of 
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each frame mid a phoneme-based weigh L, as shown in Fig. 10A. 
This result is record d in a r eult st ring section 206. 

Then, a phoneme-weight, computing section 601 computes a 
cuuversiuii-coefficlent-based weight o£ each frame, JTrom the 
relationship between each phoneme and the corresponding 
frequency converting condition over the entire speech segment 
determined in the step S502 (see Fig. 8A) and the relationship 
between a phoneme of eaen frame and a phoneme-based weight 
determined in the step S701 (see Fig. IUA) (step S/02). rig. 
XOB shows this relationship . Then , the phoneme -weight computing 
eection 601 stores the computation result in the result storing 
section 206. 

Then, the conversion-condition determining section 207 
reads the conversion-coefficient -based weight of each frame out 
of the result storing section 206. and notifies, frame by frame, 
the conversion-coefficient setting section 203 of the 
conversion coefficient having a weight other than "0". The 
conversion-coefficient setting section 203 sets the frequency 
converting section 202 with the notified conversion ooef £ ioient . 
The frequency converting section 207. again carries our a 
frequency conversion starting at the first frame by the use of 
the conversion coef f icicnto . and outputs a post -conversion 
feature parameter to the similarity or distance computing 

section 204 (step S703). 

Th n, th speech-recogniti n precessing section 209 
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reads a relationship between a conversion coefficient and a 
weight of each frame from th result storing e ction 206, and 
mul tip! 1 es a wei ght. corresponding to the conversion eoef f 1 ci ent 
gu the conversion coefficient obtained m Lhe step S704. This 
proocss ia made sequentially on all the conversion aoeff ioiento 
notified from the conversion-condition determ1.nl ng secti on 207 . 
followed by summing up those (step S704) . This computation can 
be oarricd out according to Equation 6. 



*' = J(w^XxlaJ) Equation 6 

in Equation 6. K is the feature parameter of an input 
utterance. X, is the poot -conversion feature parameter. d x is 
the conversion coefficient and is tne weight. 

The above steps S301 to S704 are for the processing of 
speaker normalization. The subsequent speech recognition 
process step S308 is similar to the speech recognition process 
step S30B of Fig. 3 explained in the embodiment 1. 

The above process of the steps S703 to S308 is carrier! 
out over the entire speech segment. 

As described above, in the present embodiment 3, the 
conversion coefficient for frequency- converting the spectrum 
of each input frame is selected In plurality to uuake * weighted 
summing-up proc es, wherein the weight set value is different 
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betwe n input frames . Consequently , speaker normalization can 
be accurately implemented frame by frame. Speech utt ranc , in 
any. can be inputted to tne speecn recognition apparatus using 
the speech normalization, thus improving the performance of 
recognition . 

Meanwhile, because weight is determined by using feature 
parameters before frequency conversion, it is possible to avoid 
frequency conversion from doubly affecting during frequency 
conversion. Thus, tne effect can be suppressed low for the 
speaker utterance the frequency conversion of which tends to 
act toward the worse. 

Fig. lie shows a result of speech recognitions according 
to the present embodiment 3 in the respective cases in which 
speaker normalization is carried out and it is not carried out . 
This test was conducted witn 100-word input by nine speakers 
who are not included in the acoustic model trained speakerc, 
with using a word lexicon hsvi ng an entry of 1 00 wordfi . Speaker 
normalization improved by 9.2* the recognition rate of the 
children who had been lower than that o£ the adult. 

This can confirm that the above effect is obtainable even 
in case a speaker normalization condition is determined by using 
a result of eontinuation-length-fixed phoneme recognition in 
the absence of segment detection of voiced and unvoiced sound 
or or distance computation between an input and a standard 
phonemic model, without" carrying ut a r cognition process 
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using a sub ject-of -recognition word lexicon. 

Meanwhile, the pr sent embodiment, although explained 
the effect of speaker normalization in case of recognizing words . 
is similarly applicable to recognising sentences or 
conversation speech. 

4. Fourth Exemplary Embodiment 

Pig. 12 shows a block diagram showing the function of an 
integrated speech remote-control unit for home-use appliances 
according to a fourth embodiment of the invention. 

A start-up switch 121 instructs a microphone 101 to start 
capturing a speech utterance, in order for the user to start 
up the integrated speech remote -control unit for home-use 
appliances. A switch 122 is for the user to input to a speech 
recognition apparatus lOO an instruction of whether speaker 
normalisation is to be made or not. A display unit 123 displays 
whether speaker normal i zsti nn is in proneRS or not from the 
speech recognition apparatus to the user. A remote-control 
aignal generator unit 124 receives a speech recognition result 
(STG4) from an output unit 110 and outputs an infrared ray of 
reuiole-control signal (SIG5). An electronic appliance group 
125 reoeivee an infrared-ray remote-control signal (SIG5) from 
the remote- control signal generator unit 124. 

Incidentally, it is possible to make a configuration not 
including the start-up switch 121. Tn such a case, the 
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configur tion may be such that the microphone 101 captures a 
speech utterance at all tUnes and s nds speech data to an A/D 
converter 102 at all tiroes, or the microphone 101 observes the 
change of power so that, when an increment in a constant time 
exceeds a threshold, handling is effected similarly to the case 
there is an instruction from the start-up switch 121. The 
operation of the microphone 101, A/D converter 102. storage 
device 104 and output unit 110 is similar to the operation of 
Fig. 1. and the explanation is omitted herein. 

in the below is explained a case that a speech recognition 
apparatus 100 of the present embodiment * »ses the speech 
recognition apparatus explained in the embodiment 3 . Note that 
it is possible to use any of the speech recognition apparatuses 
explained in the embodiments 1 to 3- 

in the integrated speech remote- control unit for home-use 
appliances of the present embodiment 4, the user is allowed to 
select whether or not to carry out speaker normalization 
depending upon an input to the switch 122. The switch 122 has 
one button, to ewitch over whether or not to carry out speaker 
normalization each time it is depressed. The instruction due 
to depressing the switch 122 is notified to the speech 
recognition apparatus 100. When speaker normalization is not 
carried out, the fact is notified to a frequency converter 
section 202 provided in the speech recognition apparatus 100, 
to change the pr cess to output a f*at.«r parameter without 



30 



2003-39-19(^13:13 jc£: 0-010-1-610-4C73701 £:fiH&HH'tt ¥M R:844 R 32/47 



making a frequency c nv rsion process . The situation of whether 
speaker normalization is being carried out or not is displayed 
on trie display unit 123. Accordingly, the user can always grasp 
the situation in a simple way. The start-up switch 121 olao has 
one button. During a constant time after the user depresses 
the start-up switch 121 in order to start a speech recognition, 
the microphone 101 captures a speech utterance at all times and 
continuously delivers it to the A/D converter 102. The A/D 
converter 102 is also continuously delivering digitized 
utterance data to the speech recognition apporntua 100. 

After the user depresses the start-up switch 121. in the 
case the power of an input utterance continuously exceeds a 
preset threshold for 1 second or longer and then becomes smaller 
than the threshold, the utterance by the user is considered 
ended and the microphone 101 halts the capture of utterance. 
The time value of 1 second exceeding the threshold is a mere 
one example . Thi s nan be changed by setting the microphone 101. 
depending upon a length of words to be recognized. Conversely, 
in the case 3 seconds elapse even if there is less variation 
in the utterance power, user ' s speech input is considered halted 
to cease speech capture. The time up to halting speech capture 
may be 5 seconds or 2 seconds, i.e. it may be changed by setting 
the microphone 101 depending upon the situation the apparatus 
is used, in case the microphone 101 halts the speech capture 
process, the process of th A/n converter 10?. and subsequent 
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is c ased. The sp ech utteranc data thus captured is render d 
a subject of speech recogniti n process in the sp ch 
recognition apparatus 100. and a result obtained is outputted 

to the output unit 110. 

For example, in the case that the user desires to put a 
lighting by the integrated speech remote -control unit for 
home-use appliances in a state the switch 122 is pushed in. in 
case giving an utterance "lighting- in a state the start-up 
switch 121 is depressed, an utterance is captured through the 
microphone 101 and converted into a digital signal in the A/D 
converter 102. then being sent to the speech recognition 
apparatus 100. The speech recognition apparatus 100 carries 
out o speech recognition process. 

Tn the example of this embodiment a, the storage device 
104 is previously stored with such words as "video recorder-, 
-lighting", "electricity" and "television" as subject-of- 
recognition words correspondingly to the electronic appliance 
group 125 as a subject of operation. In case the speech 
recognition apparatus 100 has a recognition result "lighting", 
the result is forwarded as SIG3 to tne output unit HO. The 
output unit HO outputs an output SIG4 corresponding to the 
remote-control signal. This holds the information about a 
relationship between a recognition result by the speech 
recognition apparatus 100 and the electronic appliance group 
125 to be actually control led . For example . i n either case the 



32 



2003-39-19D 13:14 ^:0-010-'.-610-4C73701 mWHWM Mf R:844 R 34/47 



utput from the SIG3 is "lighting" or -lamp" , oonv rsi n is made 
as a signal to a lighting appliance 126 of the electronic 
appliance group 12b whereby the information about the lighting 
appliance 126 is forwarded as S1G4 onto the remote- control 
signal generator unit 124. 

The remote control signal generator unit 124 converts the 
content information received as SIG4 representing control 
eignal of a to-be-controlled appliance into an infrared-ray 
remote-control signal, and then outputs it as SIG5 to the 
electronic appliance group 125. The remote control signal 
generator unit 124 is configured to issue an infrared-ray 
remote-control signal over a broad range, to issue a signal 
simultaneously to all the appliances capable of receiving an 
indoor infrared-ray remote-control signal. Because an on/ off 
toggle signal is sent by the SIG5 to the lighting appliance 126, 
putting on/off the lighting appliance can be carried out in a 
manner according to * user's speech. In the case that the 
electronic appliance group 125 placed under control of turning 
on/off power le a video recorder 127. the word "video- spoken 
ny the user is recognized. In the case of tne television 126. 
Uie word -television" is recognized to effect similar control. 

It is assumed that the integrated speech remote -control 
unit tor home-use appliances of tne Embodiment 4 is installed 
within a household in a set atate in which nearly 100 words are 
rec gnizabl . wherein the household comprises only adult men 



33 



2003-39-19(§ 13:14 ifc:CH]l(K-610-4C73701 WM R:844 p - 35/47 



nd women. Even if the user sets the switch 122 not t make 
speaker normal.i sation by th* switch 122, ^probability to put 
on/off lighting according to an utterance "lighting 0 can be 98% 
or higher provided that the speaker is an adult man or woman, 
as shown in Pig- UC However, in the case the speaker is a 
child, recognition is as low as nearly 84* without speaker 
normalization. It is generally considered that, where 
recognition performance can be secured 90% or higher, the user 
would consider "the apparatus operates accurately to utterance" . 
However, in the case of 84%, it would be considered ao an 
"apparatus not perfectly but substantially operable to 
utterance". On the other hand, even in case speaker 
normalization is carried out aa indicated by the switch 122, 
racogni t.i on rata is obtainable 93* even if the speaker is a child. 
Thus, "the apparatus is operable to utterauue" for the child. 

Speaker normalisation, in situation, is displayed on the 
display unit 123 and hence guite obvious tor the user . in order 
to make sure the speaker normalization process clearly, the 
display device 123 may make a display o£ character display 1301 . 
e.g. "Readjust Voice Now in Process Not In Process" 
representative of making speaker normalization , as shown in Fig. 
13. When speaker normalization is being carried out, "Now Tn 
Process" may be displayed with emphasis. When speaker 
normalization is not being carried out, "Not In Prooooo" may 
be displayed with emphasis. In Pig. 13. because speak r 
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normalization is under proceeding, th area "N w In Process" 
is changed In display color for emphasis. 

Meanwhile, the parameter weights ou seven discrete values 
ai to a, for frequency conversion determined in the speech 
recognition apparatus 100, if displayed on a weight display 
graph 1302, provides more explicit display. 

Although the present embodiment 4 showed the case that 
spaaker normalization in used on the Integrated speech 
remoLe-couLrol unit lor home-use appliances, the present 
embodiment 4 operable for a ueer side only by making a selection 
as to whether or not to make a speaker normalization and giving 
an instruction to start a speech recognition is similarly 
applicable, particularly, for such an appliance in which the 
user may change without notice, such as street guide terminal 
unit capable or speech operation, and appliance of coin 
telephone capable of speech operation. 

incidentally, where speaker normalization is made at all 
times, the switch 122 may be removed in the configuration. In 
this case, the user can use in a simple way because of making 
only instructing to start speech reuOHUiLion. 

The speaker normalization method and opeeeh recognition 
apparatus using the same of the invention is useful tor speech 
control unit, such as iuLeyrated speech remote-control unit for 
homo use applianoeo. otreet guide terminal unit capable of 
speech operation, and appliance of coin telephone capable of 
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speech operati n where there is exchange o£ uoer without notice . 
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